Built with R version 4.4.0.

1 Introduction

The point of this document is to make open and share-able the research methods used for my dissertation on nearshore fish communities in Alaska working towards a PhD in marine biology at the University of Alaska Fairbanks. Here, I cover most steps of the initial data preparation for my second and third chapters concerning spatial and temporal distributions of nearshore fishes across the state. This and other project files can be accessed via the Kachemak Bay National Estuarine Research Reserve’s github nearshore fish repository. Following this document, next step exploratory data analyses referencing these data can be found at this Rpubs site.

In this document I show the steps taken in wrangling the NOAA Nearshore Fish Atlas (NFA) database. This replaces the separate scripts (named “./FishAtlas_#_description.R”) and combines them into a single, streamlined R markdown file. These R scripts will be archived in case I need to replicate older methods. Namely, I used to use rgdal package for some steps in the wrangling process but that package stopped being supported ca. Oct 2023, moving my workflow to the sf package.

Link to the NFA can be found here, https://alaskafisheries.noaa.gov/mapping/sz/. The entire database can be downloaded as a .csv file, which is what I am using. However, I have additional data files that were either shared with me by NFA content managers or modified from the NOAA NFA data directory for ease of reading into R. I will try to simplify this for others to replicate in the future.

2 Wrangling site data

The idea with this initial step is to define the scale of our samples. Contributors to the NFA did not define their SiteID’s and EventID’s similar to each other, so we’ll need to distinguish between SiteID vs EventID and make sure it’s consistent throughout the data.

I define a SiteID as a unique place in space, i.e., all samples from the same ‘beach’ (‘beach’ will be defined later on). An EventID is a unique pull of the seine. So there can be single or multiple EventID’s associated with a SiteID. Additionally, I introduce a VisitID, which adds a temporal distinction among SiteID’s, i.e., all events from the same beach on the same day.

2.1 Set up

Load required packages, define directory, set options, source scripts:

# Packages
library(tidyverse)
library(lubridate)
library(here)
library(skimr)
library(sf)
library(DT)

# Directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(name, path, i)
}

# Options

# Source

2.2 Read in data

data = read_csv(file.path(dir.data, "Raw Fish Atlas Sites 2024-10-16.csv"),
               show_col_types = FALSE)
GearLookup = read_csv(file.path(dir.data, "FishAtlasExpansion_040422_Lookup_Gear.csv"),
                      show_col_types = FALSE)

2.3 Splitting site and species info

As in typical community data (site x species), we have info tied to site description and info tied to taxa. Notice that EventID is the ‘look-up’ key used in both data sets.

events.1 = data %>%
  select(SiteID, EventID,
         Date,
         Lat, Lon = Long, Region, Location,
         Habitat, TidalStage, Temp_C, Salinity,
         GearSpecific, ProjectName, PointOfContact) %>%
  mutate(Date = mdy(Date)) %>% # re-format date
  distinct()
glimpse(events.1)
## Rows: 4,473
## Columns: 14
## $ SiteID         <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4…
## $ EventID        <dbl> 15, 60, 9, 89, 433, 634, 635, 767, 90, 12, 61, 16, 17, …
## $ Date           <date> 2001-07-24, 2002-03-25, 2001-04-26, 2002-07-24, 2003-0…
## $ Lat            <dbl> 58.57520, 58.57520, 58.57520, 58.57520, 58.57520, 58.57…
## $ Lon            <dbl> -134.9263, -134.9263, -134.9263, -134.9263, -134.9263, …
## $ Region         <chr> "Gulf of Alaska", "Gulf of Alaska", "Gulf of Alaska", "…
## $ Location       <chr> "southeastern Alaska, Lynn Canal, North Island", "south…
## $ Habitat        <chr> "Kelp", "Kelp", "Kelp", "Kelp", "Kelp", "Kelp", "Kelp",…
## $ TidalStage     <chr> "LOW", "LOW", "LOW", "LOW", "LOW", "LOW", "LOW", "LOW",…
## $ Temp_C         <dbl> 13.0, 4.0, 6.0, 13.0, 5.0, 12.0, 12.0, 3.7, 13.0, 6.0, …
## $ Salinity       <dbl> 21, 33, 33, 18, 31, 16, 16, 33, 18, 33, 33, 21, 21, 33,…
## $ GearSpecific   <chr> "BSEINE-STD", "BSEINE-STD", "BSEINE-STD", "BSEINE-STD",…
## $ ProjectName    <chr> "SYNTHESIS", "SYNTHESIS", "SYNTHESIS", "SYNTHESIS", "SY…
## $ PointOfContact <chr> "NMFS, Alaska Fisheries Science Center, Auke Bay Labora…
catch.1 = data %>%
  select(EventID,
         Sp_CommonName, Sp_ScientificName, Fam_CommonName, Fam_ScientificName,
         Unmeasured, Length_mm, LengthType, LifeStage)
glimpse(catch.1)
## Rows: 161,621
## Columns: 9
## $ EventID            <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,…
## $ Sp_CommonName      <chr> "Coho salmon", "Great sculpin", "Great sculpin", "P…
## $ Sp_ScientificName  <chr> "Oncorhynchus kisutch", "Myoxocephalus polyacanthoc…
## $ Fam_CommonName     <chr> "trouts and salmons", "sculpins", "sculpins", "cods…
## $ Fam_ScientificName <chr> "Salmonidae", "Cottidae", "Cottidae", "Gadidae", "P…
## $ Unmeasured         <dbl> 0, 0, 0, 0, 0, 10, 3, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Length_mm          <dbl> 157, 208, 167, 81, 153, NA, NA, NA, NA, 66, 50, 53,…
## $ LengthType         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LifeStage          <chr> "JUV", "JUV", "JUV", "YOY", "JUV", "JUV", NA, NA, N…

2.4 Considering gear type

The NFA website makes it easy to filter for gear type, which I did prior to downloading the .csv file. However, we’ll want to account for differences among the beach seines used- namely the net mesh size. This information is contained in the metadata for the NFA database and can be accessed by request from the database managers (file FishAtlasExpansion_040422_Lookup_Gear.csv). A few instances of missing mesh size information was provided by email request to the associated point of contact for that contributed data. Results from the below code are hidden but comments explain what decisions were made.

# Join the mesh size info from our gear data to the site information
gear.1 = left_join(events.1, select(GearLookup, GearSpecific, MeshSize), by = "GearSpecific")

# Subset for gear related info and take a look
gear.2 = select(gear.1, EventID, MeshSize, GearSpecific, ProjectName, PointOfContact)
glimpse(gear.2)
# How many different mesh sizes are there?
gear.2$MeshSize %>% unique()

# Looks like we have an NA to take care of...
filter(gear.2, is.na(MeshSize))
# These are all from the GOAIERP project (PI Olav Ormseth). The report from that project reports 6 mm stretched mesh.
# We'll just go ahead and fix it here instead of changing the GearLookup file,
gear.3 = mutate(gear.2, MeshSize = ifelse(is.na(MeshSize), 6, MeshSize))

# Check our mesh sizes again,
gear.3$MeshSize %>% unique()
# Great, no more NA's.

# We can now join this information with our site data,
events.2 = left_join(events.1, select(gear.3, EventID, MeshSize), by = "EventID")

2.5 The “Site-Event” problem

The catch data is tied to unique EventID’s which are associated with spatially explicit SiteID’s (each SiteID has a unique lat/lon). Considering that the data has been contributed by multiple researchers/projects, we’ll need to make sure we have consistent sites and events across the whole dataset. After that we can tackle the catch data. First, we take a look at all of the site data with skimr::skim().

skim(events.2)
Data summary
Name events.2
Number of rows 4473
Number of columns 15
_______________________
Column type frequency:
character 7
Date 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1.00 10 16 0 5 0
Location 1 1.00 8 135 0 406 0
Habitat 0 1.00 4 13 0 8 0
TidalStage 1007 0.77 3 5 0 6 0
GearSpecific 0 1.00 8 12 0 15 0
ProjectName 0 1.00 4 31 0 31 0
PointOfContact 0 1.00 91 151 0 11 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 1976-05-29 2021-09-09 2006-04-25 1035

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SiteID 0 1.00 1173.39 597.51 1.00 725.00 1091.00 1664.00 2505.00 ▅▃▇▇▁
EventID 0 1.00 6345.33 3081.50 1.00 3613.00 6988.00 9176.00 10340.00 ▃▃▁▇▇
Lat 0 1.00 59.66 3.58 51.64 58.33 59.40 59.60 71.39 ▁▇▃▁▁
Lon 0 1.00 -145.21 9.86 -176.95 -151.63 -150.22 -134.95 -130.99 ▁▁▇▁▇
Temp_C 1966 0.56 9.97 3.45 -0.92 7.40 10.50 12.50 28.70 ▂▇▇▁▁
Salinity 1988 0.56 24.07 14.20 -400.00 19.80 25.11 30.10 73.50 ▁▁▁▁▇
MeshSize 0 1.00 5.36 3.52 3.00 3.00 3.20 6.00 12.70 ▇▂▁▁▂

In particular we want to get a sense of how EventIDs are associated with SiteIDs and other variables placing them in space and time.

We see a lot of single-pull samples when events are grouped by SiteID and Date, i.e., a visit. However, we can’t assume all of those ‘visits’ are comparable in scope. Projects differed in structure and purpose, so what one researcher would consider a new site, another researcher might call a replicate event to be combined with other events within a single sample. I call this the “Site-Event” problem.

We see that the distribution of events becomes less severe when grouped by Date and Location, which tells me that some of those single-pull samples actually should be aggregated to be comparable to other multi-pull samples. However, grouping by location may not be accurate because of the loose definition of ‘Location’ (the levels of Location also seem to vary in scale). Since location info is tied to SiteID, we can’t just look at EventID’s and lat/lon. Instead, we’ll need to figure out a way to combine events at the SiteID level. Basically, we are trying to combine SiteID’s that are currently unique but should have the same identifier, i.e., samples of the same beach.

2.6 Solving the “Site-Event” problem

Here is where we need to decide on how to define a ‘beach’. In other words, what do we consider the spatial scale distinguishing one sample from another? From my personal seining, one of my larger-distanced sites was near Anchor Point, AK, where we’ve seined up to 800 m between first and last replicate. So in my opinion, 1 km would be reasonably conservative estimate to link replicate seines. Imagine seining by skiff- you could hop between replicates ~1 km down beach and still consider it the same site. In the below code, I define clusters of SiteID’s by grouping points that are within 1000 m of at least one other SiteID.

# Make a new tibble containing SiteID geometries:
sites.sf = events.2 %>%
  select(SiteID, Lat, Lon) %>%
  # Create sf object from lat/lon:
  st_as_sf(., coords = c("Lon", "Lat"), crs = 4326) %>% # WGS84
  # Project into Alaska Albers 3338
  st_transform(crs = 3338) %>%
  distinct()

# Define a buffer of 1km around each point
sites.sf.buf = st_buffer(sites.sf, dist = 1000)

# Create a tibble with clusters of overlapping buffer areas 
sites.sf.cl = sites.sf.buf %>%
  st_union() %>% # unite buffer areas
  st_cast(to = "POLYGON") %>% # turn them into polygon features
  st_as_sf() %>% # return geometry set features to sfc
  rownames_to_column(var = "Cluster") %>% # create col for cluster groups
  mutate(Cluster = as.factor(Cluster)) # use factor class

# Assign SiteID points to buffer area clusters
sites.sf.cl.join = st_join(sites.sf, sites.sf.cl, left = TRUE)

# Drop sf class from joined data
sites.cl = st_drop_geometry(sites.sf.cl.join)

# See what we got:
glimpse(sites.cl)
## Rows: 1,178
## Columns: 2
## $ SiteID  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ Cluster <fct> 327, 327, 327, 327, 327, 326, 326, 326, 326, 327, 327, 327, 32…

We now have a simple classification variable that can be joined back with the site information.

events.3 = left_join(events.2, sites.cl, by = "SiteID")

However, this only takes into account the spatial component of our larger goal of creating a unique VisitID identifier. Next, we’ll account for the sample dates where SiteID’s within a cluster match.

# Create a date-cluster tibble containing location and event info to be combined,
orig = select(events.3, Date, Cluster, SiteID, EventID, Lat, Lon)

# Nest the df by Date and Cluster,
orig.nest = orig %>%
  group_by(Date, Cluster) %>%
  nest(Sites = SiteID,
       Events = EventID,
       Lats = Lat,
       Lons = Lon) %>%
  ungroup()

We essentially created a our new sample identifier (visit) in nesting by date (day) and cluster (beach). To bring the data back to the EventID level, we need to deal with samples that now have multiple SiteID’s and locations, as well as preserve the EventID’s associated with the original SiteID. This means we will be overwriting data, which is why I called the above subset ‘orig’ and the below ‘new’.

new = orig.nest %>% 
  rowwise() %>% # Operates a row-at-a-time (per sample)
  mutate(SiteID.cl = min(Sites), 
         Lat.mean = mean(Lats$Lat),
         Lon.mean = mean(Lons$Lon)) %>%
  select(-c(Sites, Lats, Lons)) %>% # Remove original info in nested variables
  unite(col = VisitID, SiteID.cl, Date, sep = "_", remove = FALSE) %>% # create VisitID
  select(-c(Date, SiteID.cl, Cluster)) %>% # Remove info now captured by VisitID
  unnest_longer(Events) %>% # Unnest to the EventID level
  unpack(cols = Events) # change tibble back to vector

# Join our new identifier to the rest of the site data,
events = left_join(new, events.3, by = "EventID") %>%
  mutate(Lat = Lat.mean, Lon = Lon.mean) %>% # Replace old lat/lon with new mean lat/lon
  select(-c(Lat.mean, Lon.mean))

# Check our our new data structure
head(events, 12)
## # A tibble: 12 × 17
##    VisitID      EventID SiteID Date         Lat   Lon Region    Location Habitat
##    <chr>          <dbl>  <dbl> <date>     <dbl> <dbl> <chr>     <chr>    <chr>  
##  1 1_2001-07-24      15      1 2001-07-24  58.6 -135. Gulf of … southea… Kelp   
##  2 1_2001-07-24      16      2 2001-07-24  58.6 -135. Gulf of … southea… Kelp   
##  3 1_2001-07-24      17      3 2001-07-24  58.6 -135. Gulf of … southea… Kelp   
##  4 1_2001-07-24      18      4 2001-07-24  58.6 -135. Gulf of … southea… Sand-G…
##  5 1_2001-07-24      19      5 2001-07-24  58.6 -135. Gulf of … southea… Bedrock
##  6 1_2002-03-25      60      1 2002-03-25  58.6 -135. Gulf of … southea… Kelp   
##  7 1_2002-03-25      61      2 2002-03-25  58.6 -135. Gulf of … southea… Kelp   
##  8 1_2002-03-25      62      3 2002-03-25  58.6 -135. Gulf of … southea… Kelp   
##  9 1_2002-03-25      63      4 2002-03-25  58.6 -135. Gulf of … southea… Sand-G…
## 10 1_2002-03-25      64      5 2002-03-25  58.6 -135. Gulf of … southea… Bedrock
## 11 1_2002-03-25      69     10 2002-03-25  58.6 -135. Gulf of … southea… Sand-G…
## 12 1_2002-03-25      70     11 2002-03-25  58.6 -135. Gulf of … southea… Sand-G…
## # ℹ 8 more variables: TidalStage <chr>, Temp_C <dbl>, Salinity <dbl>,
## #   GearSpecific <chr>, ProjectName <chr>, PointOfContact <chr>,
## #   MeshSize <dbl>, Cluster <fct>

We can see in first dozen rows how the locations now structured by VisitID compares to the original based on SiteID. To select an identifier for the new VisitID, I simply went with the lowest SiteID of the nest list (there were no wrong answers here). The lat/lon’s to be combined are at most a handful kms apart, so I figure a simple mean would suffice. Notice we have unique lat/lon’s at the VisitID level- this is because different visits to the same beach might have had a different number of replicates. For example, we see that VisitID 1_2001-07-24 involved the first five events (EventID’s 15-19), whereas, VisitID 1_2002-03-25 had an additional two events.

OK, great. Lastly, let’s return to our frequency plots showing the number events per sample:

Wow. There are way less single-pull samples! Also, there are some interesting samples with replicates of 9, 10, and 12. Some stats on the number of events per visit:

  • mean = 2.53

  • median = 2

  • sd = 1.76

2.7 Clean up

# Clean up our environment for the next step:
rm(list = ls()[!ls() %in% c('events', 'catch.1', 'data')])

# Reset directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(path, i)
}

3 Wrangling catch data

Similar to our wrangle of site information, we want to hone the breadth of the catch data we use. At this stage, we don’t want to make any interpretations about the catch. Instead, the goal here is to weed out objectively unneeded and suspicious cases and ensure the consistency of the data (e.g., are all possible species named actual names?).

3.1 Skimming catch

skim(catch.1)
Data summary
Name catch.1
Number of rows 161621
Number of columns 9
_______________________
Column type frequency:
character 6
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sp_CommonName 0 1.00 7 31 0 195 0
Sp_ScientificName 0 1.00 7 38 0 190 0
Fam_CommonName 9 1.00 4 18 0 34 0
Fam_ScientificName 9 1.00 7 18 0 34 0
LengthType 77836 0.52 2 6 0 7 0
LifeStage 90497 0.44 3 8 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
EventID 0 1.00 6017.42 3244.78 1 3199 6618 9192 10340 ▂▆▁▃▇
Unmeasured 0 1.00 12.80 512.01 0 0 0 0 99997 ▇▁▁▁▁
Length_mm 11535 0.93 98.33 74.16 4 48 76 124 930 ▇▁▁▁▁

One nice thing about skim() is that it quickly tells you where the information is missing and how much is missing. We see that there are nine cases of missing data at the family level. Something else that jumps out to me is that the number of unique species common names is not the same as the number of unique species scientific names.

I’ll also change the ‘Unmeasured’ variable to a count abundance, which may not be totally necessary but will make it easier for me to work with the data.

Another thing to point out is that less than half of the data appears to have life stage information. Furthermore, the data with length information is a little over 50%. We expect some missing lengths because most projects will measure size for a subset of each species and count the rest. Typically, I expect 20-50 individuals sized before a row indicated the unmeasured count, which would result in a very low percentage of “missing” lengths (maybe 5% or less?). But missing lengths over 50% indicate some projects just didn’t measure their catch, which is a shame because we could have inferred life stage from length. At this point, it doesn’t seem worth while to toss out potentially half the data to include either of those variables. Although, I may return to this down the road.

First, let’s check out all the taxa we have in the data.

3.2 Taxa list - before edits

Family

Species

3.3 Juveniles and unidentified fish

Juveniles

Unidentified

Both

catch.1 %>%
  filter(Sp_CommonName %>% str_detect("juvenile|unidentified")) %>%
  select(Fam_ScientificName, Fam_CommonName, Sp_ScientificName, Sp_CommonName) %>%
  distinct() %>%
  arrange(Sp_CommonName) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))

3.4 Fixing naming conventions

There is a frustrating mix of so called juveniles that are labelled YOY or Larval or YOY-Larval with inconsistent binning of lengths. For example, we have observations of juvenile sculpin at 9 mm with both YOY and larval labels in the data. The largest juvenile is a greenling at 460 mm. The next largest juvenile is a sculpin at 298 mm. Confusingly, there are also three cases of juvenile sculpin/salmonid that are labelled with adult life stage. Besides that, where fish are unmeasured and no life stage designated, we can probably trust that they are actually juveniles.

When we look at the observations that were labelled ‘unidentified’, we find some more issues to address. Unidentified cases usually come about because fishes were either too small to distinguish in the field (e.g., YOY morphologies), were unfamiliar to the sampler (e.g., rare species), or were taxa that are generally difficult to distinguish based on morphology (e.g., salmonids or cisco). The smallest unidentified fish with an adult life stage label is a 275 mm cisco “Coregonus species”. The largest unidentified fish labelled larval is a 51 mm “Osmeridae or Clupeidae”. This brings up an in naming convention of the unidentified fishes. For example, we have a general ‘unidentified fish’ but also ‘unidentified fish larvae’ and ‘unidentified larval forage fish’. Which brings up another issue that fish are unidentified to differing taxonomic levels- sometimes genus or family or sub-family.

It doesn’t make sense to me to have a juvenile distinction among species names nor have so many unique names among unidentified fish. The simplest fix in my opinion is just to be sure to use scientific names in analyses. We’ll have to address the two cases using “or” in Sp_ScientificName first. I also think we should then remove all unidentified instances that are simply “Division Teleostei” because it is not informative for basic questions of taxonomy. After that, we may wish to aggregate certain groups to higher taxonomy based on how frequent or rarely they occur in the data. So we should create a genus class to preserve that information in case we decide to move to the family level later.

I do want to retain the ’Division Teleostei” data as a subset because I think it could be useful when asking certain questions. I also think we could return to the idea of categorizing species based on life stage by inferring missing information. It would be interesting to examine whether we could predict what those fishes might be based on co-occurring species, time of year, location, etc.

I do not show results of the below code, but I provide an edited taxa list and see also comments for steps taken.

# Fix the "or" cases in Sp_ScientificName
catch.1a = catch.1 %>%
  # Find the 'Anisarchus or Lumpenus' species and replace with Fam_ScientificName
  mutate(Sp_ScientificName = ifelse(str_detect(Sp_ScientificName, pattern = "Anisarchus medius or Lumpenus fabricii"), Fam_ScientificName, Sp_ScientificName),
         # Replace the missing family info for "Osmeridae or Clupidae" with "Division Teloestei" and "bony fishes"
         Fam_CommonName = ifelse(is.na(Fam_CommonName), "bony fishes", Fam_CommonName),
         Fam_ScientificName = ifelse(is.na(Fam_ScientificName), "Division Teleostei", Fam_ScientificName),
         # Create genus class by taking first word of Sp_ScientificName
         Gen_ScientificName = word(Sp_ScientificName, 1))

# Remove all instances of "Division Teleostei" and save the df
catch.2 = filter(catch.1a, Fam_ScientificName != "Division Teleostei")

# Store the removed fishes as .rds in our data folder
filter(catch.1a, Fam_ScientificName == "Division Teleostei") %>%
  saveRDS(., file = file.path(dir.data, "nfa_teleostei.rds"))

3.5 Taxa list - after edits

Family

Genus

Species

3.6 Replacing ‘Unmeasured’ with ‘Count’

The below code changes the ‘Unmeasured’ parameter into ‘Count’ abundance:

catch.3 = mutate(catch.2, Count = ifelse(Unmeasured == 0, 1, Unmeasured)) %>%
  select(-Unmeasured) # remove unused parameter

Next, I’ll make graphs and tables describing the abundance and occurrence of our catch:

3.7 Visualizing catch

Family

Genus

Species

3.8 Defining rare taxa

Removing rare species is common when dealing with community data because of the affect that the most rare species would have in typical analyses to be conducted (e.g., analysis of variance). Depending on our research questions, we may not consider rare cases functional members of the community- in these cases it may make more sense to view the community as those taxa most well represented. But if I were more interested in richness, then I would want to conservatively retain rare taxa.

I think we can convince ourselves that removing single count and even single occurence species would be reasonable, because we do not have any information to let us know whether those cases represent an actual community member or whether they represent some sort of sampling error (e.g., mistakenly labelling all the salmonids in a particular sample Atlantic Salmon rather than a Pacific). Our research questions broadly deal with spatial and temporal distributions of community structure, so I think a ‘middle of the road’ approach is how we’ll want to proceed (at least for now).

We’ll want to be objective in how we define rare taxa. To me, it makes sense to consider occurrence rather than abundance. Our decision should balance the information lost vs retained, and consider how it affects each level of taxonomy. However, I do not know what taxa level makes most sense for future analyses at this point so we’ll want to keep our options open. For now, let’s consider it “rare” when taxa appear in less than 5 total visits (less than 0.25% percent occurrence). With that, removing rare taxa causes:

  • 17.6% reduction in richness at the Family level (6 of 34)

  • 25.6% reduction in richness at the Genus level (30 of 117)

  • 31.4% reduction in richness at the Species Level (59 of 188)

Note that this fixed cut-off affects species diversity more so than family diversity. Some of this can be explained by the fact that the species list contains many identifications at the genus and family level, which makes sense since identification often includes cases where researchers aggregate ‘iffy’ ID’s into a higher class. I expect this issue to be especially apparent in a collaborative dataset like the NFA.

3.9 Lists of rare taxa

Family

Genus

Species

Cool- now let’s re-visit our abundance and occurrence graphs to see how they look:

3.10 Catch after removing rare taxa

Family

Genus

Species

Last thing we’ll do is rename our catch tibble and clean up our environment.

# Re-assign working version of catch to a cleaner name
catch = catch.3

3.11 Clean up

# Clean up our environment for the next step:
rm(list = ls()[!ls() %in% c('data', 'events', 'catch', 'catch.fam', 'catch.gen', 'catch.sp', 'rare.families', 'rare.genus', 'rare.species')])

# Reset directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(path, i)
}

4 Wrangling visit data

4.1 Set up

Remember that our samples operate at the visit level, where unique events represent replicates within each sample. Let’s create a new tibble where each row represents a sample. We’ll want to keep track of how many replicates each sample contains so that we may account for this in future analyses.

# New tibble for visit level wrangling,
visits.1 = select(events, VisitID, EventID) %>%
  summarise(Replicates = n_distinct(EventID), .by = VisitID) %>%
  right_join(select(events, -SiteID, -EventID), by = "VisitID") %>%
  distinct()
visits.1 %>% nrow()
## [1] 3034

4.2 Skimming visits

There are supposed to be 1770 visits, but we see more rows than that in our dataset. When we merged sites together, there must’ve been samples that contained mutliple entries for some of our variables (e.g., more than one Location or Habitat). Let’s skim the data to remind ourselves of our variables:

skim(visits.1)
Data summary
Name visits.1
Number of rows 3034
Number of columns 16
_______________________
Column type frequency:
character 8
Date 1
factor 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
VisitID 0 1.00 12 15 0 1770 0
Region 0 1.00 10 16 0 5 0
Location 1 1.00 8 135 0 406 0
Habitat 0 1.00 4 13 0 8 0
TidalStage 694 0.77 3 5 0 6 0
GearSpecific 0 1.00 8 12 0 15 0
ProjectName 0 1.00 4 31 0 31 0
PointOfContact 0 1.00 91 151 0 11 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 1976-05-29 2021-09-09 2006-06-14 1035

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Cluster 0 1 FALSE 436 140: 203, 143: 161, 136: 114, 348: 113

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Replicates 0 1.0 3.27 2.00 1.00 2.00 3.00 4.00 12.00 ▇▂▂▁▁
Lat 0 1.0 59.91 4.07 51.64 58.30 59.42 59.61 71.39 ▁▇▅▁▂
Lon 0 1.0 -145.37 9.93 -176.95 -151.71 -149.65 -134.95 -131.00 ▁▁▇▁▇
Temp_C 1208 0.6 9.84 3.53 -0.92 7.31 10.35 12.47 28.70 ▂▇▇▁▁
Salinity 1218 0.6 24.33 15.97 -400.00 20.18 26.00 30.35 73.50 ▁▁▁▁▇
MeshSize 0 1.0 4.60 2.91 3.00 3.00 3.20 6.00 12.70 ▇▁▁▁▁

Right away, we can rule out the variables that we’ve already dealt with and know are unique at the visit level: VisitID, Lat/Lon, Date, and Cluster.

We can also simply drop the information that we don’t need to keep in this version of the data, namely ProjectName, PointOfContact, and Gear Specific. I think we can also remove the TidalStage parameter, because I don’t care to account for it at this point.

# Update tibble to remove unnecessary site info,
visits.2 = select(visits.1, -c(ProjectName, PointOfContact, GearSpecific, TidalStage)) %>% distinct()
visits.2 %>% nrow()
## [1] 2766

Still not quite there. I bet a lot of visits contain multiple temps, salinity, habitats, and probably locations. We might want to use those later on, but for now we can just drop them because all their info is contained in our event level data object:

# Update tibble to remove 'problem' variables,
visits.3 = select(visits.2, -c(Region, Location, Habitat, Temperature = Temp_C, Salinity)) %>% distinct()
visits.3 %>% nrow()
## [1] 1867

Pretty darn close. Hopefully we don’t have samples with multiple mesh sizes… (spoiler) we do. Yuck.

4.3 The multiple gear-type problem

First, we isolate the cases where singular visits contain replicates of differing mesh sizes:

# New object for visits with multiple mesh size
mult.mesh = select(visits.3, VisitID, MeshSize) %>%
  summarise(`n Mesh Sizes` = n_distinct(MeshSize), .by = VisitID) %>%
  filter(`n Mesh Sizes` > 1)
# View
mult.mesh
## # A tibble: 63 × 2
##    VisitID         `n Mesh Sizes`
##    <chr>                    <int>
##  1 1083_2016-05-07              2
##  2 1083_2016-05-22              2
##  3 1083_2016-06-05              2
##  4 1083_2016-06-21              2
##  5 1083_2016-07-04              2
##  6 1083_2016-08-02              2
##  7 1083_2017-04-11              3
##  8 1083_2016-09-02              2
##  9 1083_2017-04-27              3
## 10 1083_2017-05-14              3
## # ℹ 53 more rows

Damn. Let’s get a better idea of how different these mesh sizes are, and what data sets they come from:

# Join multiple mesh events with other event level info,
mult.mesh.events = select(mult.mesh, VisitID) %>% left_join(events, by = "VisitID")
# Skim the project and date info,
select(mult.mesh.events, GearSpecific, ProjectName, Date) %>%
  mutate(across(c(GearSpecific, ProjectName), ~ as_factor(.x))) %>%
  skim()
Data summary
Name Piped data
Number of rows 382
Number of columns 3
_______________________
Column type frequency:
Date 1
factor 2
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2016-05-06 2017-09-07 2017-04-30 63

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
GearSpecific 0 1 FALSE 3 BSE: 174, BSE: 166, BSE: 42
ProjectName 0 1 FALSE 1 Gla: 382
# Summary of gear and mesh size per visit,
select(mult.mesh.events, VisitID, GearSpecific, MeshSize) %>% 
  summarise(n_visits = n_distinct(VisitID), .by = c(GearSpecific, MeshSize))
## # A tibble: 3 × 3
##   GearSpecific MeshSize n_visits
##   <chr>           <dbl>    <int>
## 1 BSEINE-B          9.5       62
## 2 BSEINE-W         12.7       63
## 3 BSEINE-small      6.4       35

The good thing is that I recognize these data. The visits that contain different mesh size events come from the same project: EPSCoR Fire & Ice Lynn Canal group in Southeast AK. They reported that different mesh sizes had an effect on community composition using side-by-side comparison of gear types in their methods (Lundstrom et al. 2022). This explains why the cluster and grouping steps combined the events.

The bad news is that we’ll need to re-do previous wrangling steps on which mesh size is depoendent. This will involve separating the affected VisitID’s into separate samples based on mesh size, which will result in about 100 more unique VisitID’s and different mean abundance associated catches.

We’ll need to start by updating our events level data object. Since this involves overwriting a major data stage, I’ll be explicit in what I’m doing:

# Create a tibble to store the VisitID's to be fixed
mult.mesh.fix = select(mult.mesh.events, VisitID, EventID, GearSpecific, MeshSize) %>%
  # Create a unique suffix that identifies mesh size used per event
  mutate(suffix = gsub("BSEINE-", "", GearSpecific)) %>%
  # Add the suffix to VisitID, creating new VisitID's
  unite(col = VisitID_new, VisitID, suffix, sep = "_") %>%
  # Retain only the VisitID's and their EventID's
  select(EventID, VisitID_new)

# Update the 'events' tibble with fixed VisitID's,
events = left_join(select(events, EventID, VisitID), mult.mesh.fix, by = "EventID") %>%
  # Where VisitID's were not fixed, fill in with old VisitID's
  mutate(VisitID_new = ifelse(is.na(VisitID_new), VisitID, VisitID_new)) %>%
  # Remove old VisitID column
  select(-VisitID) %>%
  # Re-join events data (w/o the old VisitID's) with the new VisitID's,
  left_join(select(events, -VisitID), by = c("EventID")) %>%
  # Rename (this could be done earlier but seems clearer here)
  rename(VisitID = VisitID_new)

Next, we need to re-run the catch and visits steps that are based on the events data.

4.4 Re-creating catch summaries

We’ll need to redo the abundance and occurrence calculations because the number of visits has changed. Rare taxa will still be defined as those occurring in less than 5 visits (still equivalent to <0.25%). The frequency and abundance plots are not all that different, but I show updated catch summary tables below for each taxa level. Note that I highlight rare taxa in orange.

Family

Genus

Species

Now we can re-run the VisitID wrangling code from earlier in this section:

# Re-run visits.1 code,
visits.4 = select(events, VisitID, EventID) %>%
  summarise(Replicates = n_distinct(EventID), .by = VisitID) %>%
  right_join(select(events, -SiteID, -EventID), by = "VisitID") %>%
  distinct() %>%
  # Re-run visits.2 & visits.3 code,
  select(-c(ProjectName, PointOfContact, GearSpecific, TidalStage,
            Region, Location, Habitat, Temperature = Temp_C, Salinity)) %>%
  distinct()

4.5 Re-visiting sample frequency

We now have 1867 unique visits. Let’s double check that there is only one row per VisitID:

n_distinct(visits.4$VisitID) == nrow(visits.4)
## [1] TRUE

Let’s also review our events per visit frequency plot because that will have changed too.

The difference isn’t very noticeable- we end up with a central tendency that has slightly shifted left.

Last, we save our fixed visits tibble as a new object with a cleaner name, and then clean up the environment for next steps.

visits = visits.4

5 Clean up and save work

# Remove unwanted objects
rm(list = ls()[!ls() %in% c('data', 'events', 'catch', 'visits',
                            'fam.tbl', 'gen.tbl', 'sp.tbl')])

# Check sizes of the data objects
eapply(as.environment(-1), FUN = function(x) format(object.size(x), units = "Mb"))
## $gen.tbl
## [1] "0 Mb"
## 
## $data
## [1] "42.3 Mb"
## 
## $fam.tbl
## [1] "0 Mb"
## 
## $sp.tbl
## [1] "0 Mb"
## 
## $visits
## [1] "0.2 Mb"
## 
## $catch
## [1] "12.3 Mb"
## 
## $events
## [1] "0.8 Mb"
# The original 'data' import is quite large,
# but also unneeded bc the info is held in other objects, so let's remove it:
rm('data')

# Save what's left,
save(list = ls(), file = file.path(here(), "data", "NFA_wrangled.rda"))
---
title: "NOAA Nearshore Fish Atlas, Data Wrangle"
author: "Chris Guo"
date: "Last compiled on `r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: TRUE
    toc_depth: 2
    toc_float:
      collapsed: FALSE
      print: FALSE
    number_sections: TRUE
    code_download: TRUE
theme: "flatly"
---

```{r include = FALSE}
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(size = "scriptsize")
```

Built with R version `r getRversion()`.

# Introduction

The point of this document is to make open and share-able the research methods used for my dissertation on nearshore fish communities in Alaska working towards a PhD in marine biology at the University of Alaska Fairbanks. Here, I cover most steps of the initial data preparation for my second and third chapters concerning spatial and temporal distributions of nearshore fishes across the state. This and other project files can be accessed via the Kachemak Bay National Estuarine Research Reserve's github [nearshore fish repository](https://github.com/kbnerr/nearshore-fish). Following this document, next step exploratory data analyses referencing these data can be found at [this Rpubs site](https://rpubs.com/chguo1/1232673).

In this document I show the steps taken in wrangling the NOAA Nearshore Fish Atlas (NFA) database. This replaces the separate scripts (named "./FishAtlas\_#\_description.R") and combines them into a single, streamlined R markdown file. These R scripts will be archived in case I need to replicate older methods. Namely, I used to use **rgdal** package for some steps in the wrangling process but that package stopped being supported ca. Oct 2023, moving my workflow to the **sf** package.

Link to the NFA can be found here, <https://alaskafisheries.noaa.gov/mapping/sz/>. The entire database can be downloaded as a .csv file, which is what I am using. However, I have additional data files that were either shared with me by NFA content managers or modified from the NOAA NFA data directory for ease of reading into R. I will try to simplify this for others to replicate in the future.

# Wrangling site data

The idea with this initial step is to define the scale of our samples. Contributors to the NFA did not define their SiteID's and EventID's similar to each other, so we'll need to distinguish between SiteID vs EventID and make sure it's consistent throughout the data.

I define a SiteID as *a unique place in space*, i.e., all samples from the same 'beach' ('beach' will be defined later on). An EventID is *a unique pull of the seine*. So there can be single or multiple EventID's associated with a SiteID. Additionally, I introduce a VisitID, which adds a temporal distinction among SiteID's, i.e., *all events from the same beach on the same day*.

## Set up

Load required packages, define directory, set options, source scripts:

```{r}
# Packages
library(tidyverse)
library(lubridate)
library(here)
library(skimr)
library(sf)
library(DT)

# Directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(name, path, i)
}

# Options

# Source

```

## Read in data

```{r}
data = read_csv(file.path(dir.data, "Raw Fish Atlas Sites 2024-10-16.csv"),
               show_col_types = FALSE)
GearLookup = read_csv(file.path(dir.data, "FishAtlasExpansion_040422_Lookup_Gear.csv"),
                      show_col_types = FALSE)
```

## Splitting site and species info

As in typical community data (site x species), we have info tied to site description and info tied to taxa. Notice that EventID is the 'look-up' key used in both data sets.

```{r}
events.1 = data %>%
  select(SiteID, EventID,
         Date,
         Lat, Lon = Long, Region, Location,
         Habitat, TidalStage, Temp_C, Salinity,
         GearSpecific, ProjectName, PointOfContact) %>%
  mutate(Date = mdy(Date)) %>% # re-format date
  distinct()
glimpse(events.1)

catch.1 = data %>%
  select(EventID,
         Sp_CommonName, Sp_ScientificName, Fam_CommonName, Fam_ScientificName,
         Unmeasured, Length_mm, LengthType, LifeStage)
glimpse(catch.1)
```

## Considering gear type

The NFA website makes it easy to filter for gear type, which I did prior to downloading the .csv file. However, we'll want to account for differences among the beach seines used- namely the net mesh size. This information is contained in the metadata for the NFA database and can be accessed by request from the database managers (file *FishAtlasExpansion_040422_Lookup_Gear.csv*). A few instances of missing mesh size information was provided by email request to the associated point of contact for that contributed data. Results from the below code are hidden but comments explain what decisions were made.

```{r results = 'hide'}
# Join the mesh size info from our gear data to the site information
gear.1 = left_join(events.1, select(GearLookup, GearSpecific, MeshSize), by = "GearSpecific")

# Subset for gear related info and take a look
gear.2 = select(gear.1, EventID, MeshSize, GearSpecific, ProjectName, PointOfContact)
glimpse(gear.2)
# How many different mesh sizes are there?
gear.2$MeshSize %>% unique()

# Looks like we have an NA to take care of...
filter(gear.2, is.na(MeshSize))
# These are all from the GOAIERP project (PI Olav Ormseth). The report from that project reports 6 mm stretched mesh.
# We'll just go ahead and fix it here instead of changing the GearLookup file,
gear.3 = mutate(gear.2, MeshSize = ifelse(is.na(MeshSize), 6, MeshSize))

# Check our mesh sizes again,
gear.3$MeshSize %>% unique()
# Great, no more NA's.

# We can now join this information with our site data,
events.2 = left_join(events.1, select(gear.3, EventID, MeshSize), by = "EventID")
```

## The "Site-Event" problem

The catch data is tied to unique EventID's which are associated with spatially explicit SiteID's (each SiteID has a unique lat/lon). Considering that the data has been contributed by multiple researchers/projects, we'll need to make sure we have consistent sites and events across the whole dataset. After that we can tackle the catch data. First, we take a look at all of the site data with **skimr::skim()**.

```{r}
skim(events.2)
```

In particular we want to get a sense of how EventIDs are associated with SiteIDs and other variables placing them in space and time.

```{r echo = FALSE}
events.2 %>% summarise(Events = n_distinct(EventID), .by = c(SiteID, Date)) %>%
  ggplot(data = ., aes(x = Events)) +
  geom_bar(aes(y = after_stat(count))) +
  geom_text(stat = 'Count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(title = "Frequency of events (replicates) per SiteID-Date pair (samples)")

events.2 %>% summarise(Events = n_distinct(EventID), .by = c(Date, Location)) %>%
  ggplot(data = ., aes(x = Events)) +
  geom_bar(aes(y = after_stat(count))) +
  geom_text(stat = 'Count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(title = "Frequency of events per Location-Date pair")
```

We see a lot of single-pull samples when events are grouped by SiteID and Date, i.e., a *visit*. However, we can't assume all of those 'visits' are comparable in scope. Projects differed in structure and purpose, so what one researcher would consider a new site, another researcher might call a replicate event to be combined with other events within a single sample. I call this the "Site-Event" problem.

We see that the distribution of events becomes less severe when grouped by Date and Location, which tells me that some of those single-pull samples actually should be aggregated to be comparable to other multi-pull samples. However, grouping by location may not be accurate because of the loose definition of 'Location' (the levels of Location also seem to vary in scale). Since location info is tied to SiteID, we can't just look at EventID's and lat/lon. Instead, we'll need to figure out a way to combine events at the SiteID level. Basically, we are trying to combine SiteID's that are currently unique but should have the same identifier, i.e., samples of the same beach.

## Solving the "Site-Event" problem

Here is where we need to decide on how to define a 'beach'. In other words, what do we consider the spatial scale distinguishing one sample from another? From my personal seining, one of my larger-distanced sites was near Anchor Point, AK, where we've seined up to 800 m between first and last replicate. So in my opinion, 1 km would be reasonably conservative estimate to link replicate seines. Imagine seining by skiff- you could hop between replicates \~1 km down beach and still consider it the same site. In the below code, I define clusters of SiteID's by grouping points that are within 1000 m of at least one other SiteID.

```{r}
# Make a new tibble containing SiteID geometries:
sites.sf = events.2 %>%
  select(SiteID, Lat, Lon) %>%
  # Create sf object from lat/lon:
  st_as_sf(., coords = c("Lon", "Lat"), crs = 4326) %>% # WGS84
  # Project into Alaska Albers 3338
  st_transform(crs = 3338) %>%
  distinct()

# Define a buffer of 1km around each point
sites.sf.buf = st_buffer(sites.sf, dist = 1000)

# Create a tibble with clusters of overlapping buffer areas 
sites.sf.cl = sites.sf.buf %>%
  st_union() %>% # unite buffer areas
  st_cast(to = "POLYGON") %>% # turn them into polygon features
  st_as_sf() %>% # return geometry set features to sfc
  rownames_to_column(var = "Cluster") %>% # create col for cluster groups
  mutate(Cluster = as.factor(Cluster)) # use factor class

# Assign SiteID points to buffer area clusters
sites.sf.cl.join = st_join(sites.sf, sites.sf.cl, left = TRUE)

# Drop sf class from joined data
sites.cl = st_drop_geometry(sites.sf.cl.join)

# See what we got:
glimpse(sites.cl)
```

We now have a simple classification variable that can be joined back with the site information.

```{r}
events.3 = left_join(events.2, sites.cl, by = "SiteID")
```

However, this only takes into account the spatial component of our larger goal of creating a unique VisitID identifier. Next, we'll account for the sample dates where SiteID's within a cluster match.

```{r}
# Create a date-cluster tibble containing location and event info to be combined,
orig = select(events.3, Date, Cluster, SiteID, EventID, Lat, Lon)

# Nest the df by Date and Cluster,
orig.nest = orig %>%
  group_by(Date, Cluster) %>%
  nest(Sites = SiteID,
       Events = EventID,
       Lats = Lat,
       Lons = Lon) %>%
  ungroup()
```

We essentially created a our new sample identifier (visit) in nesting by date (day) and cluster (beach). To bring the data back to the EventID level, we need to deal with samples that now have multiple SiteID's and locations, as well as preserve the EventID's associated with the original SiteID. This means we will be overwriting data, which is why I called the above subset 'orig' and the below 'new'.

```{r}
new = orig.nest %>% 
  rowwise() %>% # Operates a row-at-a-time (per sample)
  mutate(SiteID.cl = min(Sites), 
         Lat.mean = mean(Lats$Lat),
         Lon.mean = mean(Lons$Lon)) %>%
  select(-c(Sites, Lats, Lons)) %>% # Remove original info in nested variables
  unite(col = VisitID, SiteID.cl, Date, sep = "_", remove = FALSE) %>% # create VisitID
  select(-c(Date, SiteID.cl, Cluster)) %>% # Remove info now captured by VisitID
  unnest_longer(Events) %>% # Unnest to the EventID level
  unpack(cols = Events) # change tibble back to vector

# Join our new identifier to the rest of the site data,
events = left_join(new, events.3, by = "EventID") %>%
  mutate(Lat = Lat.mean, Lon = Lon.mean) %>% # Replace old lat/lon with new mean lat/lon
  select(-c(Lat.mean, Lon.mean))

# Check our our new data structure
head(events, 12)
```

We can see in first dozen rows how the locations now structured by VisitID compares to the original based on SiteID. To select an identifier for the new VisitID, I simply went with the lowest SiteID of the nest list (there were no wrong answers here). The lat/lon's to be combined are at most a handful kms apart, so I figure a simple mean would suffice. Notice we have unique lat/lon's at the VisitID level- this is because different visits to the same beach might have had a different number of replicates. For example, we see that VisitID 1_2001-07-24 involved the first five events (EventID's 15-19), whereas, VisitID 1_2002-03-25 had an additional two events.

OK, great. Lastly, let's return to our frequency plots showing the number events per sample:

```{r echo = FALSE}
# Plot events per SiteID-Date same as before,
events.3 %>% summarise(Events = n_distinct(EventID), .by = c(SiteID, Date)) %>%
  ggplot(data = ., aes(x = Events)) +
  geom_bar(aes(y = after_stat(count))) +
  geom_text(stat = 'Count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(title = "Frequency of events per sample in original data structure")
# ggsave("nfa_freq-Site&Date-by-seines.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# And events per visit
events %>% summarise(Events = n_distinct(EventID), .by = VisitID) %>%
  ggplot(data = ., aes(x = Events)) +
  geom_bar(aes(y = after_stat(count))) +
  geom_text(stat = 'Count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(title = "Frequency of events per sample after solving Site-Event problem")
# ggsave("nfa_freq-visits-by-seines.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

```

Wow. There are way less single-pull samples! Also, there are some interesting samples with replicates of 9, 10, and 12. Some stats on the number of events per visit:

-   mean = `r summarise(events, events = n_distinct(EventID), .by = VisitID) %>% select(-VisitID) %>% as_vector() %>% mean() %>% round(2)`

-   median = `r summarise(events, events = n_distinct(EventID), .by = VisitID) %>% select(-VisitID) %>% as_vector() %>% median() %>% round(2)`

-   sd = `r summarise(events, events = n_distinct(EventID), .by = VisitID) %>% select(-VisitID) %>% as_vector() %>% sd() %>% round(2)`

## Clean up

```{r}
# Clean up our environment for the next step:
rm(list = ls()[!ls() %in% c('events', 'catch.1', 'data')])

# Reset directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(path, i)
}
```

# Wrangling catch data

Similar to our wrangle of site information, we want to hone the breadth of the catch data we use. At this stage, we don't want to make any *interpretations* about the catch. Instead, the goal here is to weed out objectively unneeded and suspicious cases and ensure the consistency of the data (e.g., are all possible species named actual names?).

## Skimming catch

```{r}
skim(catch.1)
```

One nice thing about **skim()** is that it quickly tells you where the information is missing and how much is missing. We see that there are nine cases of missing data at the family level. Something else that jumps out to me is that the number of unique species common names is not the same as the number of unique species scientific names.

I'll also change the 'Unmeasured' variable to a count abundance, which may not be totally necessary but will make it easier for me to work with the data.

Another thing to point out is that less than half of the data appears to have life stage information. Furthermore, the data with length information is a little over 50%. We expect some missing lengths because most projects will measure size for a subset of each species and count the rest. Typically, I expect 20-50 individuals sized before a row indicated the unmeasured count, which would result in a very low percentage of "missing" lengths (maybe 5% or less?). But missing lengths over 50% indicate some projects just didn't measure their catch, which is a shame because we could have inferred life stage from length. At this point, it doesn't seem worth while to toss out potentially half the data to include either of those variables. Although, I may return to this down the road.

First, let's check out all the taxa we have in the data.

## Taxa list - before edits {.tabset .tabset-pills}

### Family {.unnumbered}

```{r echo = FALSE}
catch.1 %>%
  select(Fam_ScientificName, Fam_CommonName) %>%
  rename(`Scientific Name` = Fam_ScientificName, `Common Name` = Fam_CommonName) %>%
  distinct() %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Species {.unnumbered}

```{r echo = FALSE}
catch.1 %>%
  select(Sp_ScientificName, Sp_CommonName) %>%
  rename(`Scientific Name` = Sp_ScientificName, `Common Name` = Sp_CommonName) %>%
  distinct() %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

## Juveniles and unidentified fish {.tabset .tabset-pills}

### Juveniles {.unnumbered}
```{r echo = FALSE}
# Juvenile species table
catch.1 %>%
  filter(Sp_CommonName %>% str_detect("juvenile")) %>%
  select(Sp_ScientificName, Sp_CommonName, Unmeasured, Length_mm, LifeStage) %>%
  rename(`Scientific Name` = Sp_ScientificName,
         `Common Name` = Sp_CommonName,
         `N Unmeasured` = Unmeasured,
         `Length (mm)` = Length_mm,
         `Life Stage` = LifeStage) %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Unidentified {.unnumbered}

``` {r echo = FALSE}
# Unidentified species table
catch.1 %>%
  filter(Sp_CommonName %>% str_detect("unidentified")) %>%
  select(Sp_ScientificName, Sp_CommonName, Unmeasured, Length_mm, LifeStage) %>%
  rename(`Scientific Name` = Sp_ScientificName,
         `Common Name` = Sp_CommonName,
         `N Unmeasured` = Unmeasured,
         `Length (mm)` = Length_mm,
         `Life Stage` = LifeStage) %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Both {.unnumbered}

```{r}
catch.1 %>%
  filter(Sp_CommonName %>% str_detect("juvenile|unidentified")) %>%
  select(Fam_ScientificName, Fam_CommonName, Sp_ScientificName, Sp_CommonName) %>%
  distinct() %>%
  arrange(Sp_CommonName) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```
## Fixing naming conventions

There is a frustrating mix of so called juveniles that are labelled YOY or Larval or YOY-Larval with inconsistent binning of lengths. For example, we have observations of juvenile sculpin at 9 mm with both YOY and larval labels in the data. The largest juvenile is a greenling at 460 mm. The next largest juvenile is a sculpin at 298 mm. Confusingly, there are also three cases of juvenile sculpin/salmonid that are labelled with adult life stage. Besides that, where fish are unmeasured and no life stage designated, we can probably trust that they are actually juveniles.

When we look at the observations that were labelled 'unidentified', we find some more issues to address. Unidentified cases usually come about because fishes were either too small to distinguish in the field (e.g., YOY morphologies), were unfamiliar to the sampler (e.g., rare species), or were taxa that are generally difficult to distinguish based on morphology (e.g., salmonids or cisco). The smallest unidentified fish with an adult life stage label is a 275 mm cisco "Coregonus species". The largest unidentified fish labelled larval is a 51 mm "Osmeridae or Clupeidae". This brings up an in naming convention of the unidentified fishes. For example, we have a general 'unidentified fish' but also 'unidentified fish larvae' and 'unidentified larval forage fish'. Which brings up another issue that fish are unidentified to differing taxonomic levels- sometimes genus or family or sub-family.

It doesn't make sense to me to have a juvenile distinction among species names nor have so many unique names among unidentified fish. The simplest fix in my opinion is just to be sure to use scientific names in analyses. We'll have to address the two cases using "or" in Sp_ScientificName first. I also think we should then remove all unidentified instances that are simply "Division Teleostei" because it is not informative for basic questions of taxonomy. After that, we may wish to aggregate certain groups to higher taxonomy based on how frequent or rarely they occur in the data. So we should create a genus class to preserve that information in case we decide to move to the family level later.

I do want to retain the 'Division Teleostei" data as a subset because I think it could be useful when asking certain questions. I also think we could return to the idea of categorizing species based on life stage by inferring missing information. It would be interesting to examine whether we could predict what those fishes might be based on co-occurring species, time of year, location, etc.

I do not show results of the below code, but I provide an edited taxa list and see also comments for steps taken.

```{r results = "hide"}
# Fix the "or" cases in Sp_ScientificName
catch.1a = catch.1 %>%
  # Find the 'Anisarchus or Lumpenus' species and replace with Fam_ScientificName
  mutate(Sp_ScientificName = ifelse(str_detect(Sp_ScientificName, pattern = "Anisarchus medius or Lumpenus fabricii"), Fam_ScientificName, Sp_ScientificName),
         # Replace the missing family info for "Osmeridae or Clupidae" with "Division Teloestei" and "bony fishes"
         Fam_CommonName = ifelse(is.na(Fam_CommonName), "bony fishes", Fam_CommonName),
         Fam_ScientificName = ifelse(is.na(Fam_ScientificName), "Division Teleostei", Fam_ScientificName),
         # Create genus class by taking first word of Sp_ScientificName
         Gen_ScientificName = word(Sp_ScientificName, 1))

# Remove all instances of "Division Teleostei" and save the df
catch.2 = filter(catch.1a, Fam_ScientificName != "Division Teleostei")

# Store the removed fishes as .rds in our data folder
filter(catch.1a, Fam_ScientificName == "Division Teleostei") %>%
  saveRDS(., file = file.path(dir.data, "nfa_teleostei.rds"))
```

## Taxa list - after edits {.tabset .tabset-pills}

### Family {.unnumbered}

```{r echo = FALSE}
catch.2 %>%
  select(Fam_ScientificName, Fam_CommonName) %>%
  rename(`Scientific Name` = Fam_ScientificName, `Common Name` = Fam_CommonName) %>%
  distinct() %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Genus {.unnumbered}

```{r echo = FALSE}
catch.2 %>%
  select(`Scientific Name` = Gen_ScientificName) %>%
  distinct() %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Species {.unnumbered}

```{r echo = FALSE}
catch.2 %>%
  select(Sp_ScientificName, Sp_CommonName) %>%
  rename(`Scientific Name` = Sp_ScientificName, `Common Name` = Sp_CommonName) %>%
  distinct() %>%
  arrange(`Scientific Name`) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

## Replacing 'Unmeasured' with 'Count'

The below code changes the 'Unmeasured' parameter into 'Count' abundance:

```{r results = "hide"}
catch.3 = mutate(catch.2, Count = ifelse(Unmeasured == 0, 1, Unmeasured)) %>%
  select(-Unmeasured) # remove unused parameter
```

Next, I'll make graphs and tables describing the abundance and occurrence of our catch:

## Visualizing catch {.tabset .tabset-pills}

### Family {.unnumbered}

```{r echo = FALSE, fig.width = 12, fig.height = 8}
# Family abundance
fam.abun = catch.3 %>%
  summarise(Abundance = sum(Count), .by = Fam_ScientificName) %>%
  mutate(Fam_ScientificName = fct_reorder(as.factor(Fam_ScientificName), desc(Abundance)))
# Plot
ggplot(dat = fam.abun, aes(x = Fam_ScientificName, y = Abundance)) +
  geom_col() +
  geom_text(aes(label = Abundance), vjust = -0.5, size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Family abundance', x = 'Family')
# ggsave("nfa_fam_abun_raw.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Family frequency of occurrence
fam.freq.occur = left_join(catch.3, select(events, VisitID, EventID), by = "EventID") %>%
  summarise(Presence = n_distinct(Fam_ScientificName), .by = c(VisitID, Fam_ScientificName)) %>%
  summarise(Occurrence = sum(Presence), .by = Fam_ScientificName) %>%
  mutate(Perc_Occurrence = round(Occurrence / n_distinct(events$VisitID) * 100, 2),
         Fam_ScientificName = fct_reorder(as.factor(Fam_ScientificName), desc(Perc_Occurrence)))
# Plot
ggplot(data = fam.freq.occur, aes(x = Fam_ScientificName, y = Perc_Occurrence)) +
  geom_col() +
  geom_text(aes(label = round(Perc_Occurrence, 1)), size = 2, vjust = -0.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Family frequency of occurrence', x = 'Family', y = 'Percent occurrence')
# ggsave("nfa_fam_freq-occur.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Table of family abundance and occurrence
full_join(fam.abun, fam.freq.occur, by = 'Fam_ScientificName') %>%
  select(`Scientific Name` = Fam_ScientificName, `Percent Occurrence` = Perc_Occurrence, Abundance) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Genus {.unnumbered}

```{r echo = FALSE, fig.width = 12, fig.height = 8}
# Genus abundance
gen.abun = catch.3 %>%
  summarise(Abundance = sum(Count), .by = Gen_ScientificName) %>%
  mutate(Gen_ScientificName = fct_reorder(as.factor(Gen_ScientificName), desc(Abundance)))
# Plot
ggplot(data = gen.abun, aes(x = Gen_ScientificName, y = Abundance)) +
  geom_col() +
  geom_text(aes(label = Abundance), vjust = -0.5, size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Genus abundance', x = 'Genus')
# ggsave("nfa_gen_abun_raw.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Genus frequency of occurrence
gen.freq.occur = left_join(catch.3, select(events, VisitID, EventID), by = "EventID") %>%
  summarise(Presence = n_distinct(Gen_ScientificName), .by = c(VisitID, Gen_ScientificName)) %>%
  summarise(Occurrence = sum(Presence), .by = Gen_ScientificName) %>%
  mutate(Perc_Occurrence = round(Occurrence / n_distinct(events$VisitID) * 100, 2),
         Gen_ScientificName = fct_reorder(as.factor(Gen_ScientificName), desc(Perc_Occurrence)))
# Plot 
ggplot(data = gen.freq.occur, aes(x = Gen_ScientificName, y = Perc_Occurrence)) +
  geom_col() +
  geom_text(aes(label = round(Perc_Occurrence, 1)), size = 2, vjust = -0.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Genus frequency of occurrence', x = 'Genus', y = 'Percent occurrence')
# ggsave("nfa_gen_freq-occur.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Table of genus abundance and occurrence
full_join(gen.abun, gen.freq.occur, by = 'Gen_ScientificName') %>%
  select(`Scientific Name` = Gen_ScientificName, `Percent Occurrence` = Perc_Occurrence, Abundance) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

### Species {.unnumbered}

```{r echo = FALSE, fig.width = 12, fig.height = 8}
# Species abundance
sp.abun = catch.3 %>%
  summarise(Abundance = sum(Count), .by = Sp_ScientificName) %>%
  mutate(Sp_ScientificName = fct_reorder(as.factor(Sp_ScientificName), desc(Abundance)))
# Plot
ggplot(data = sp.abun, aes(x = Sp_ScientificName, y = Abundance)) +
  geom_col() +
  geom_text(aes(label = Abundance), vjust = -0.5, size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Species abundance', x = 'Species')
# ggsave("nfa_sp_abun_raw.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Species frequency of occurrence
sp.freq.occur = left_join(catch.3, select(events, VisitID, EventID), by = "EventID") %>%
  summarise(Presence = n_distinct(Sp_ScientificName), .by = c(VisitID, Sp_ScientificName)) %>%
  summarise(Occurrence = sum(Presence), .by = Sp_ScientificName) %>%
  mutate(Perc_Occurrence = round(Occurrence / n_distinct(events$VisitID) * 100, 2),
         Sp_ScientificName = fct_reorder(as.factor(Sp_ScientificName), desc(Perc_Occurrence)))
# Plot
ggplot(data = sp.freq.occur, aes(x = Sp_ScientificName, y = Perc_Occurrence)) +
  geom_col() +
  geom_text(aes(label = round(Perc_Occurrence, 1)), size = 2, vjust = -0.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Species frequency of occurrence', x = 'Species', y = 'Percent occurrence')
# ggsave("nfa_sp_freq-occur.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Table of species abundance and occurrence
full_join(sp.abun, sp.freq.occur, by = 'Sp_ScientificName') %>%
  select(`Scientific Name` = Sp_ScientificName, `Percent Occurrence` = Perc_Occurrence, Abundance) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE))
```

## Defining rare taxa

Removing rare species is common when dealing with community data because of the affect that the most rare species would have in typical analyses to be conducted (e.g., analysis of variance). Depending on our research questions, we may not consider rare cases functional members of the community- in these cases it may make more sense to view the community as those taxa most well represented. But if I were more interested in richness, then I would want to conservatively retain rare taxa.

I think we can convince ourselves that removing single count and even single occurence species would be reasonable, because we do not have any information to let us know whether those cases represent an actual community member or whether they represent some sort of sampling error (e.g., mistakenly labelling all the salmonids in a particular sample Atlantic Salmon rather than a Pacific). Our research questions broadly deal with spatial and temporal distributions of community structure, so I think a 'middle of the road' approach is how we'll want to proceed (at least for now).

We'll want to be objective in how we define rare taxa. To me, it makes sense to consider occurrence rather than abundance. Our decision should balance the information lost vs retained, and consider how it affects each level of taxonomy. However, I do not know what taxa level makes most sense for future analyses at this point so we'll want to keep our options open. For now, let's consider it "rare" when taxa appear in less than 5 total visits (less than 0.25% percent occurrence). With that, removing rare taxa causes:

- 17.6% reduction in richness at the Family level (6 of 34)

- 25.6% reduction in richness at the Genus level (30 of 117)

- 31.4% reduction in richness at the Species Level (59 of 188)

Note that this fixed cut-off affects species diversity more so than family diversity. Some of this can be explained by the fact that the species list contains many identifications at the genus and family level, which makes sense since identification often includes cases where researchers aggregate 'iffy' ID's into a higher class. I expect this issue to be especially apparent in a collaborative dataset like the NFA.

## Lists of rare taxa {.tabset .tabset-pills}

### Family {.unnumbered}

```{r echo = FALSE}
# Subset rare families
rare.fam = full_join(fam.abun, fam.freq.occur, by = 'Fam_ScientificName') %>%
  filter(Perc_Occurrence < 0.25) %>%
  select(Fam_ScientificName, Perc_Occurrence, Abundance)
# Table of rare family abundance and occurrence
rename(rare.fam, `Scientific Name` = Fam_ScientificName, `Percent Occurrence` = Perc_Occurrence) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 250,
                           scroller = TRUE))
```

### Genus {.unnumbered}

```{r echo = FALSE}
# Subset rare genus
rare.gen = full_join(gen.abun, gen.freq.occur, by = 'Gen_ScientificName') %>%
  filter(Perc_Occurrence < 0.25) %>%
  select(Gen_ScientificName, Perc_Occurrence, Abundance)
# Table of rare genus abundance and occurrence
rename(rare.gen, `Scientific Name` = Gen_ScientificName, `Percent Occurrence` = Perc_Occurrence) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 250,
                           scroller = TRUE))
```

### Species {.unnumbered}

```{r echo = FALSE}
# Subset rare species
rare.sp = full_join(sp.abun, sp.freq.occur, by = 'Sp_ScientificName') %>%
  filter(Perc_Occurrence < 0.25) %>%
  select(Sp_ScientificName, Perc_Occurrence, Abundance)
# Table of rare genus abundance and occurrence
rename(rare.sp, `Scientific Name` = Sp_ScientificName, `Percent Occurrence` = Perc_Occurrence) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 250,
                           scroller = TRUE))
```

Cool- now let's re-visit our abundance and occurrence graphs to see how they look:

## Catch after removing rare taxa {.tabset .tabset-pills}

### Family {.unnumbered}

```{r echo = FALSE, fig.width = 12, fig.height = 8}
# Plot abundance
anti_join(fam.abun, rare.fam, by = "Fam_ScientificName") %>%
  ggplot(aes(x = Fam_ScientificName, y = Abundance)) +
  geom_col() +
  geom_text(aes(label = Abundance), vjust = -0.5, size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Family abundance excluding rare taxa', x = 'Family')
# ggsave("nfa_fam_abun_raw_ex-rare.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Plot frequency of occurrence
anti_join(fam.freq.occur, rare.fam, by = "Fam_ScientificName") %>%
  ggplot(aes(x = Fam_ScientificName, y = Perc_Occurrence)) +
  geom_col() +
  geom_text(aes(label = round(Perc_Occurrence, 1)), size = 2, vjust = -0.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Family frequency of occurrence excluding rare taxa', x = 'Family', y = 'Percent occurrence')
# ggsave("nfa_fam_freq-occur_ex-rare.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))
```

### Genus {.unnumbered}

```{r echo = FALSE, fig.width = 12, fig.height = 8}
# Plot abundance
anti_join(gen.abun, rare.gen) %>%
  ggplot(aes(x = Gen_ScientificName, y = Abundance)) +
  geom_col() +
  geom_text(aes(label = Abundance), vjust = -0.5, size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Genus abundance excluding rare taxa', x = 'Genus')
# ggsave("nfa_gen_abun_raw_ex-rare.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Plot frequency of occurrence
anti_join(gen.freq.occur, rare.gen) %>%
  ggplot(aes(x = Gen_ScientificName, y = Perc_Occurrence)) +
  geom_col() +
  geom_text(aes(label = round(Perc_Occurrence, 1)), size = 2, vjust = -0.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Genus frequency of occurrence excluding rare taxa', x = 'Genus', y = 'Percent occurrence')
# ggsave("nfa_gen_freq-occur_ex-rare.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))
```

### Species {.unnumbered}

```{r echo = FALSE, fig.width = 12, fig.height = 8}
# Plot abundance
anti_join(sp.abun, rare.sp) %>%
  ggplot(aes(x = Sp_ScientificName, y = Abundance)) +
  geom_col() +
  geom_text(aes(label = Abundance), vjust = -0.5, size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Species abundance excluding rare taxa', x = 'Species')
# ggsave("nfa_sp_abun_raw_ex-rare.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))

# Plot frequency of occurrence
anti_join(sp.freq.occur, rare.sp) %>%
  ggplot(aes(x = Sp_ScientificName, y = Perc_Occurrence)) +
  geom_col() +
  geom_text(aes(label = round(Perc_Occurrence, 1)), size = 2, vjust = -0.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Species frequency of occurrence excluding rare taxa', x = 'Species', y = 'Percent occurrence')
# ggsave("nfa_sp_freq-occur_ex-rare.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))
```

Last thing we'll do is rename our catch tibble and clean up our environment.

```{r}
# Re-assign working version of catch to a cleaner name
catch = catch.3
```

## Clean up

```{r}
# Clean up our environment for the next step:
rm(list = ls()[!ls() %in% c('data', 'events', 'catch', 'catch.fam', 'catch.gen', 'catch.sp', 'rare.families', 'rare.genus', 'rare.species')])

# Reset directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(path, i)
}
```

# Wrangling visit data

## Set up

Remember that our samples operate at the visit level, where unique events represent replicates within each sample. Let's create a new tibble where each row represents a sample. We'll want to keep track of how many replicates each sample contains so that we may account for this in future analyses.

```{r}
# New tibble for visit level wrangling,
visits.1 = select(events, VisitID, EventID) %>%
  summarise(Replicates = n_distinct(EventID), .by = VisitID) %>%
  right_join(select(events, -SiteID, -EventID), by = "VisitID") %>%
  distinct()
visits.1 %>% nrow()
```

## Skimming visits

There are supposed to be `r visits.1$VisitID %>% n_distinct()` visits, but we see more rows than that in our dataset. When we merged sites together, there must've been samples that contained mutliple entries for some of our variables (e.g., more than one Location or Habitat). Let's skim the data to remind ourselves of our variables:

```{r}
skim(visits.1)
```

Right away, we can rule out the variables that we've already dealt with and know are unique at the visit level: VisitID, Lat/Lon, Date, and Cluster.

We can also simply drop the information that we don't need to keep in this version of the data, namely ProjectName, PointOfContact, and Gear Specific. I think we can also remove the TidalStage parameter, because I don't care to account for it at this point.

```{r}
# Update tibble to remove unnecessary site info,
visits.2 = select(visits.1, -c(ProjectName, PointOfContact, GearSpecific, TidalStage)) %>% distinct()
visits.2 %>% nrow()
```

Still not quite there. I bet a lot of visits contain multiple temps, salinity, habitats, and probably locations. We might want to use those later on, but for now we can just drop them because all their info is contained in our event level data object:

```{r}
# Update tibble to remove 'problem' variables,
visits.3 = select(visits.2, -c(Region, Location, Habitat, Temperature = Temp_C, Salinity)) %>% distinct()
visits.3 %>% nrow()
```

Pretty darn close. Hopefully we don't have samples with multiple mesh sizes... (spoiler) we do. Yuck.

## The multiple gear-type problem

First, we isolate the cases where singular visits contain replicates of differing mesh sizes:

```{r}
# New object for visits with multiple mesh size
mult.mesh = select(visits.3, VisitID, MeshSize) %>%
  summarise(`n Mesh Sizes` = n_distinct(MeshSize), .by = VisitID) %>%
  filter(`n Mesh Sizes` > 1)
# View
mult.mesh
```

Damn. Let's get a better idea of how different these mesh sizes are, and what data sets they come from:

```{r}
# Join multiple mesh events with other event level info,
mult.mesh.events = select(mult.mesh, VisitID) %>% left_join(events, by = "VisitID")
# Skim the project and date info,
select(mult.mesh.events, GearSpecific, ProjectName, Date) %>%
  mutate(across(c(GearSpecific, ProjectName), ~ as_factor(.x))) %>%
  skim()
# Summary of gear and mesh size per visit,
select(mult.mesh.events, VisitID, GearSpecific, MeshSize) %>% 
  summarise(n_visits = n_distinct(VisitID), .by = c(GearSpecific, MeshSize))
```

The good thing is that I recognize these data. The visits that contain different mesh size events come from the same project: EPSCoR Fire & Ice Lynn Canal group in Southeast AK. They reported that different mesh sizes had an effect on community composition using side-by-side comparison of gear types in their methods ([Lundstrom et al. 2022](https://doi.org/10.1007/s12237-022-01057-x)). This explains why the cluster and grouping steps combined the events.

The bad news is that we'll need to re-do previous wrangling steps on which mesh size is depoendent. This will involve separating the affected VisitID's into separate samples based on mesh size, which will result in about 100 more unique VisitID's and different mean abundance associated catches.

We'll need to start by updating our events level data object. Since this involves overwriting a major data stage, I'll be explicit in what I'm doing:

```{r}
# Create a tibble to store the VisitID's to be fixed
mult.mesh.fix = select(mult.mesh.events, VisitID, EventID, GearSpecific, MeshSize) %>%
  # Create a unique suffix that identifies mesh size used per event
  mutate(suffix = gsub("BSEINE-", "", GearSpecific)) %>%
  # Add the suffix to VisitID, creating new VisitID's
  unite(col = VisitID_new, VisitID, suffix, sep = "_") %>%
  # Retain only the VisitID's and their EventID's
  select(EventID, VisitID_new)

# Update the 'events' tibble with fixed VisitID's,
events = left_join(select(events, EventID, VisitID), mult.mesh.fix, by = "EventID") %>%
  # Where VisitID's were not fixed, fill in with old VisitID's
  mutate(VisitID_new = ifelse(is.na(VisitID_new), VisitID, VisitID_new)) %>%
  # Remove old VisitID column
  select(-VisitID) %>%
  # Re-join events data (w/o the old VisitID's) with the new VisitID's,
  left_join(select(events, -VisitID), by = c("EventID")) %>%
  # Rename (this could be done earlier but seems clearer here)
  rename(VisitID = VisitID_new)
```

Next, we need to re-run the catch and visits steps that are based on the events data.

## Re-creating catch summaries {.tabset .tabset-pills}

We'll need to redo the abundance and occurrence calculations because the number of visits has changed. Rare taxa will still be defined as those occurring in less than 5 visits (still equivalent to <0.25%). The frequency and abundance plots are not all that different, but I show updated catch summary tables below for each taxa level. Note that I highlight rare taxa in orange.

### Family {.unnumbered}

```{r echo = FALSE}
## Re-summarize catch
# Abundance
fam.abun = summarise(catch, Abundance = sum(Count), .by = Fam_ScientificName)
# Frequency of occurrence
fam.freq.occur = left_join(catch, select(events, VisitID, EventID), by = "EventID") %>%
  summarise(Presence = n_distinct(Fam_ScientificName), .by = c(VisitID, Fam_ScientificName)) %>%
  summarise(Occurrence = sum(Presence), .by = Fam_ScientificName) %>%
  mutate(Perc_Occurrence = round(Occurrence / n_distinct(events$VisitID) * 100, 2))
# Combine calculated variables and define rare taxa
fam.tbl = full_join(fam.abun, fam.freq.occur, by = 'Fam_ScientificName') %>%
  mutate(Rare = ifelse(Occurrence < 5, "Yes", "No"))
# Save rare taxa as vector
rare.families = filter(fam.tbl, Rare == "Yes") %>% pull(Fam_ScientificName)
# Summary table
fam.tbl %>%
  select(`Scientific Name` = Fam_ScientificName, `Percent Occurrence` = Perc_Occurrence, Abundance, Rare) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE,
                           columnDefs = list(list(targets = 'Rare', visible = FALSE)))) %>% 
  formatStyle(columns = "Rare",
              target = "row",
              backgroundColor = styleEqual("Yes", "goldenrod"))
```

### Genus {.unnumbered}

``` {r echo = FALSE}
## Re-summarize catch
# Abundance
gen.abun = summarise(catch, Abundance = sum(Count), .by = Gen_ScientificName)
# Frequency of occurrence
gen.freq.occur = left_join(catch, select(events, VisitID, EventID), by = "EventID") %>%
  summarise(Presence = n_distinct(Gen_ScientificName), .by = c(VisitID, Gen_ScientificName)) %>%
  summarise(Occurrence = sum(Presence), .by = Gen_ScientificName) %>%
  mutate(Perc_Occurrence = round(Occurrence / n_distinct(events$VisitID) * 100, 2))
# Combine calculated variables and define rare taxa
gen.tbl = full_join(gen.abun, gen.freq.occur, by = 'Gen_ScientificName') %>%
  mutate(Rare = ifelse(Occurrence < 5, "Yes", "No"))
# Save rare taxa as vector
rare.genus = filter(gen.tbl, Rare == "Yes") %>% pull(Gen_ScientificName)
# Summary table
gen.tbl %>%
  select(`Scientific Name` = Gen_ScientificName, `Percent Occurrence` = Perc_Occurrence, Abundance, Rare) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE,
                           columnDefs = list(list(targets = 'Rare', visible = FALSE)))) %>% 
  formatStyle(columns = "Rare",
              target = "row",
              backgroundColor = styleEqual("Yes", "goldenrod"))
```

### Species {.unnumbered}

``` {r echo = FALSE}
## Re-summarize catch
# Abundance
sp.abun = summarise(catch, Abundance = sum(Count), .by = c(Sp_ScientificName, Sp_CommonName))
# Frequency of occurrence
sp.freq.occur = left_join(catch, select(events, VisitID, EventID), by = "EventID") %>%
  summarise(Presence = n_distinct(Sp_ScientificName), .by = c(VisitID, Sp_ScientificName)) %>%
  summarise(Occurrence = sum(Presence), .by = Sp_ScientificName) %>%
  mutate(Perc_Occurrence = round(Occurrence / n_distinct(events$VisitID) * 100, 2))
# Combine calculated variables and define rare taxa
sp.tbl = full_join(sp.abun, sp.freq.occur, by = 'Sp_ScientificName') %>%
  mutate(Rare = ifelse(Occurrence < 5, "Yes", "No"))
# Save rare taxa as vector
rare.species = filter(sp.tbl, Rare == "Yes") %>% pull(Sp_ScientificName)
# Summary table
sp.tbl %>%
  select(`Scientific Name` = Sp_ScientificName, `Common Name` = Sp_CommonName, `Percent Occurrence` = Perc_Occurrence, Abundance, Rare) %>%
  arrange(as.character(`Scientific Name`)) %>%
  datatable(rownames = FALSE,
            extensions = 'Scroller',
            options = list(deferRender = TRUE,
                           scrollY = 500,
                           scroller = TRUE,
                           columnDefs = list(list(targets = 'Rare', visible = FALSE)))) %>% 
  formatStyle(columns = "Rare",
              target = "row",
              backgroundColor = styleEqual("Yes", "goldenrod"))
```

Now we can re-run the VisitID wrangling code from earlier in this section:

```{r}
# Re-run visits.1 code,
visits.4 = select(events, VisitID, EventID) %>%
  summarise(Replicates = n_distinct(EventID), .by = VisitID) %>%
  right_join(select(events, -SiteID, -EventID), by = "VisitID") %>%
  distinct() %>%
  # Re-run visits.2 & visits.3 code,
  select(-c(ProjectName, PointOfContact, GearSpecific, TidalStage,
            Region, Location, Habitat, Temperature = Temp_C, Salinity)) %>%
  distinct()
```

## Re-visiting sample frequency

We now have `r visits.4$VisitID %>% n_distinct()` unique visits. Let's double check that there is only one row per VisitID:

```{r}
n_distinct(visits.4$VisitID) == nrow(visits.4)
```

Let's also review our events per visit frequency plot because that will have changed too.

```{r echo = FALSE}
# And events per visit
events %>% summarise(Events = n_distinct(EventID), .by = VisitID) %>%
  ggplot(data = ., aes(x = Events)) +
  geom_bar(aes(y = after_stat(count))) +
  geom_text(stat = 'Count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(title = "Frequency of events per sample after solving multiple gear-type problem")
# ggsave("nfa_freq-events-per-sample.png", plot = last_plot(), device = 'png', path = file.path(dir.figs))
```

The difference isn't very noticeable- we end up with a central tendency that has slightly shifted left.

Last, we save our fixed visits tibble as a new object with a cleaner name, and then clean up the environment for next steps.

```{r}
visits = visits.4
```

# Clean up and save work

```{r}
# Remove unwanted objects
rm(list = ls()[!ls() %in% c('data', 'events', 'catch', 'visits',
                            'fam.tbl', 'gen.tbl', 'sp.tbl')])

# Check sizes of the data objects
eapply(as.environment(-1), FUN = function(x) format(object.size(x), units = "Mb"))
# The original 'data' import is quite large,
# but also unneeded bc the info is held in other objects, so let's remove it:
rm('data')

# Save what's left,
save(list = ls(), file = file.path(here(), "data", "NFA_wrangled.rda"))
```